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Abstract. The longest common extension problem (LCE problem) is to 
construct a data structure for an input string T of length n that supports 
LCE(z, j) queries. Such a query returns the length of the longest common 
prefix of the suffixes starting at positions i and j in T. This classic 
problem has a well-known solution that uses 0(n) space and 0(1) query 
time. In this paper we show that for any trade-off parameter 1 < r < n, 
the problem can be solved in 0((p) space and 0(r) query time. This 
significantly improves the previously best known time-space trade-offs, 
and almost matches the best known time-space product lower bound. 


1 Introduction 

Given a string T, the longest common extension of suffix i and j, denoted 
LCE(i, j), is the length of the longest common prefix of the suffixes of T starting 
at position i and j. The longest common extension problem (LCE problem) is 
to preprocess T into a compact data structure supporting fast longest common 
extension queries. 

The LCE problem is a basic primitive that appears as a central subproblem in 
a wide range of string matching problems such as approximate string matching 
and its variations [1,4,13,15,21], computing exact or approximate repetitions 
[6,14,18], and computing palindromes [12,19]. In many cases the LCE problem 
is the computational bottleneck. 

Here we study the time-space trade-offs for the LCE problem, that is, the 
space used by the preprocessed data structure vs. the worst-case time used by 
LCE queries. The input string is given in read-only memory and is not counted 
in the space complexity. Throughout the paper we use £ as a shorthand for 
LCEThe standard trade-offs are as follows: At one extreme we can store 
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a suffix tree combined with an efficient nearest common ancestor (NCA) data 
structure [7,22]. This solution uses 0(n) space and supports LCE queries in 0(1) 
time. At the other extreme we do not store any data structure and instead answer 
queries simply by comparing characters from left-to-right in T. This solution uses 
0(1) space and answers an LCE(i, j) query in 0(£) = 0(n ) time. Recently, Bille 
et al. [2] presented a number of results. For a trade-off parameter r, they gave: 1) 
a deterministic solution with O(^) space and 0(r 2 ) query time, 2) a randomized 
Monte Carlo solution with O(^) space and 0(r log(^)) = 0(r log(^)) query 
time, where all queries are correct with high probability, and 3) a randomized Las 
Vegas solution with the same bounds as 2) but where all queries are guaranteed 
to be correct. Bille et al. [2] also gave a lower bound showing that any data 
structure for the LCE problem must have a time-space product of i2(n) bits. 

Our Results Let r be a trade-off parameter. We present four new solutions with 
the following improved bounds. Unless otherwise noted the space bound is the 
number of words on a standard RAM with logarithmic word size, not including 
the input string, which is given in read-only memory. 

— A deterministic solution with 0(n/r ) space and 0(r log 2 (n/r)) query time. 

— A randomized Monte Carlo solution with 0(n/r) space and 0{t) query time, 
such that all queries are correct with high probability. 

— A randomized Las Vegas solution with 0(n/r) space and 0(t) query time. 

— A derandomized version of the Monte Carlo solution with 0(n/r) space and 
O(t) query time. 

Hence, we obtain the first trade-off for the LCE problem with a linear time-space 
product in the full range from constant to linear space. This almost matches the 
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Table 1 . Overview of solutions for the LCE problem. Here £ = LCE(i,j), e > 0 is an 
arbitrarily small constant and w.li.p. (with high probability) means with probability 
at least 1 — n~ c for an arbitrarily large constant c. The data structure is correct if it 
answers all LCE queries correctly. 




time-space product lower bound of fi(n) bits, and improves the best deterministic 
upper bound by a factor of r, and the best randomized bound by a factor log(^). 
See the columns marked Data Structure in Table f for a complete overview. 

While our main focus is the space and query time complexity, we also provide 
efficient preprocessing algorithms for building the data structures, supporting 
independent trade-offs between the preprocessing time and preprocessing space. 
See the columns marked Preprocessing in Table 1. 

To achieve our results we develop several new techniques and specialized data 
structures which are likely of independent interest. For instance, in our determin¬ 
istic solution we develop a novel recursive decomposition of LCE queries and for 
the randomized solution we develop a new sampling technique for Karp-Rabin 
fingerprints that allow fast LCE queries. We also give a general technique for 
efficiently derandomizing algorithms that rely on “few” or “short” Karp-Rabin 
fingerprints, and apply the technique to derandomize our Monte Carlo algo¬ 
rithm. To the best of our knowledge, this is the first derandomization technique 
for Karp-Rabin fingerprints. 

Preliminaries We assume an integer alphabet, i.e., T is chosen from some al¬ 
phabet E = {0,..., n c } for some constant c, so every character of T fits in 0(1) 
words. For integers a < b, [a, b] denotes the range {a, a + 1,..., b} and we define 
[?r] = [0, n — 1]. For a string S = S[1]S[2]... 5'[|5'|] and positions 1 < i < j < IS), 
S[h..j] = S[t]S[i + 1] • • • S[j] is a substring of length j — i + 1, S[i...] = S[i, |S|] 
is the i th suffix of S, and S[...i] = S[l,i] is the i th prefix of S. 

2 Deterministic Trade-Off 

Here we describe a completely deterministic trade-off for the LCE problem with 
0{- log -) space and O(rlog^) query time for any r e [1, n]. Substituting 
t = r/log(n/r), we obtain the bounds reflected in Table 1 for t £ [1/log n,n}. 

A key component in this solution is the following observation that allows us 
to reduce an LCE (i,j) query on T to another query LCE(i',j') where i! and j' 
are both indices in either the first or second half of T. 

Observation 1 Let i,j and j' be indices ofT, and suppose that LCE(j',j) > 
LCE(i,/). Then LCE(i, j) = min(LCE(i, j'), LCE(j', j)). 

We apply Observation 1 recursively to bring the indices of the initial query within 
distance r in O(log(n/r)) rounds. We show how to implement each round with 
a data structure using 0(n/r) space and 0(t) time. This leads to a solution 
using (D(-log-) space and O(rlog^) query time. Finally in Section 2.4, we 
show how to efficiently solve the LCE problem for indicies within distance r in 
0{n/r) space and 0(t) time by exploiting periodicity properties of LCEs. 

2.1 The Data Structure 

We will store several data structures, each responsible for a specific subinterval 
/ = [a, b] C [l,n] of positions of the input string T. Let Aeft = [a, (a + b)/ 2], 



/right = ((a + b)/ 2, b\, and |/| = b — a + 1. The task of the data structure for 
I will be to reduce an LCE (i,j) query where i,j £ I to one where both indices 
belong to either I\ e { t or / r i g ht- 

The data structure stores information for 0(\I\/t) suffixes of T that start in 
/right ■ More specifically, we store information for the suffixes starting at positions 
b — kr £ /right: k = 0,1,..., (|/|/2)/r. We call these the sampled positions of 
/right- See Figure 1 for an illustration. 
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Fig. 1. Illustration of the contents of the data structure for the interval 7 = [a, b] . The 
black dots are the sampled positions in /right, and each such position has a pointer to 
an index j' k £ 


For every sampled position b — kr £ /right, k = 0,1,..., (|/|/2)/r, we store 
the index j' k of the suffix starting in 7i e ft that achieves the maximum LCE value 
with the suffix starting at the sampled position, i.e., T[b — kr...} (ties broken 
arbitrarily). Along with j k , we also store the value of the LCE between suffix 
T[j' k ...] and T[b — kr...]. Formally, j' k and L k are defined as follows, 

j' k = argmaxLCE(h, b — kr) and L k = LCE(j(,, b — kr) . 

h ^ heft 


Building the structure We construct the above data structure for the interval 
[l,n], and build it recursively for [l,?r/2] and (n/2, n], stopping when the length 
of the interval becomes smaller than r. 

2.2 Answering a Query 

We now describe how to reduce a query LCE (i,j) where i,j £ I to one where 
both indices are in either I\ e f t or / r ; g ht- Suppose without loss of generality that 
i £ /i e f t and j £ / r ; g ht- We start by comparing 6 < r pairs of characters of T, 
starting with T[i] = T[j ], until 1) we encounter a mismatch, 2) both positions 
are in / r i g ht or 3) we reach a sampled position in / r i g ht- It suffices to describe the 
last case, in which T[i , * + <5] = T[j , j + 6], i + S £ /i e ft and j + 5 = b — kr £ /right 
for some k. Then by Observation 1, we have that 

LCE (i,j) = 5 + LCE(i + <5, j + 5) 

= 5 + min(LCE(i + 5,j k ), LCEQ'j), b — kr)) 

= S + min(LCE(* + S,j’ k ), L k ) . 




Thus, we have reduced the original query to computing the query LCE(i + S,j' k ) 
in which both indices are in 


2.3 Analysis 

Each round takes 0(t) time and halves the upper bound for \i—j\, which initially 
is n. Thus, after 0(rlog(n/r)) time, the initial LCE query has been reduced to 
one where |i — j | < r. At each of the O(log(n/r)) levels, the number of sampled 
positions is (n/2)/r, so the total space used is 0{{n/r) log(n/r)). 

2.4 Queries with Nearby Indices 

We now describe the data structure used to answer a query LCE (i,j) when 
\i — j\ < t. We first give some necessary definitions and properties of periodic 
strings. We say that the integer 1 < p < IS 1 ! is a period of a string S if any 
two characters that are p positions apart in S match, i.e., iS'fz] = S[i + p] for all 
positions i s.t. 1 < i < i + p < |5|. The following is a well-known property of 
periods. 

Lemma 1 (Fine and Wilf [5]). If a string S has periods a and b and |> 
|a| + |6| — gcd(a, b), then gcd(a, b) is also a period of S. 

The period of S is the smallest period of S and we denote it by per (S'). If 
per(S) < |S|/2, we say S is periodic. A periodic string S might have many 
periods smaller than |S|/2, however it follows from the above lemma that 

Corollary 1. All periods smaller than |S|/2 are multiples o/per(S). 

The Data Structure Let T k = T[kr...(k + 2)r — 1] denote the substring of 
length 2r starting at position fcr in T, k = 0,1,..., n/r. For the strings T k that 
are periodic, let pu = per(Tfc) be the period. For every periodic T k , the data 
structure stores the length l k of the maximum substring starting at position fcr, 
which has period pk■ Nothing is stored if Tk is aperiodic. 


Answering a Query We may assume without loss of generality that i = kr, for 
some integer k. If not, then we check whether T[i + 5\ = T[j + 5] until * + d = fcr. 
Hence, assume that i = kr and j = i + d for some 0 < d < r. In 0{t) time, we 
first check whether T[i + <5] = T\j + <5] for all 6 £ [0, 2r], If we find a mismatch 
we are done, and otherwise we return LCE(i, j) = l k — d. 


Correctness If a mismatch is found when checking that T[i + S] = T[j + <5] 
for all 6 £ [0,2t], the answer is clearly correct. Otherwise, we have established 
that d < t is a period of T/ c , so Tk is periodic and d is a multiple of pk (by 
Corollary 1). Consequently, T[i + 5] = T[j + d] for all 6 s.t. d + S < £k, and thus 
LCE(i, j) = 4 - d. 



2.5 Preprocessing 


The preprocessing of the data structure is split into the preprocessing of the 
recursive data structure for queries where \i — j\ > r and the preprocessing of 
the data structure for nearby indices, i.e., |* — j\ < r. 

If we allow 0(n) space during the preprocessing phase then both can be 
solved in 0(n) time. If we insist on 0(n/r) space during preprocessing then the 
times are not as good. 


Recursive Data Structure Say O(n) space is allowed during preprocessing. 
Generate a 2D range searching environment with a point (x(l),y(l)) for every 
text index l £ [1, n]. Set x(l) = l and y(l) equal to the rank of suffix T[l...j in the 
lexicographical sort of the suffixes. To compute j' k £ Ii e ft = [a, (a + b)/ 2] that has 
the maximal LCE value with a sampled position i — b—kr £ /right = ((a+b)/2, 6] 
we require two 3-sided range successor queries, see detailed definitions in [17]. 
The first is a range [a, (a + b) /2] x (—oo, y(i) — 1] which returns the point within 
the range with y-coordinate closest to, and less-than, y(i). The latter is a range 
[a, (a + b)/ 2] x \y(i) + 1, oo) with point with y-coordinate closest to, and greater- 
than, y(i) returned. 

Let V and l" be the ^-coordinates, i.e., text indices, of the points returned by 
the queries. By definition both are in [a, (a + b)/ 2]. Among suffixes starting in 
[a, (a + b)/ 2], T[l'...\ has the largest rank in the lexicographic sort which is less 
than l. Likewise, T[l"...] has the smallest rank in the lexicographic sort which is 
greater than l. Hence, j' k is equal to either V or l", and the larger of LCE(7,Z') 
and LCE(Z, l") determines which one. 

The LCE computation can be implemented with standard suffix data struc¬ 
tures in 0(n) space and 0(1) query time for an overall 0(n) time. 3-sided range 
successor queries can be preprocessed in O(nlogn) space and time and queries 
can be answered in O(loglogn) time, see [11,16] for an overall 0(?iloglogn) 
time and 0(nlogn) space. 

If one desires to keep the space limited to 0(n/r) space then generate a 
sparse suffix array for the (evenly spaced) suffixes starting at kr in / r ; g htj k = 
0,1,..., (|/|/2)/r, see [9]. Now for every suffix in Ji e f t find its location in the 
sparse suffix array. From this data one can compute the desired answer. The 
time to compute the sparse suffix array is 0(n ) and the space is 0(n/r). The 
time to search a suffix in /i e ft is O(n + \og(\l\/r)) using the classical suffix search 
in the suffix array [20]. The time per / is 0(\I\n). Overall levels this yields 
0(n ■ n + n- (n/2) + n • (n/4) -\ + n ■ 1) = 0(n 2 ) time. 


The Nearby Indices Data Structure If 7), is periodic then we need to find its 
period pk and find the length £ k of the maximum substring starting at position 
fcr, which has period pk- 

Finding period pk, if it exists, in T k can be done in 0(r) time and constant 
space using the algorithm of [3]. The overall time is 0(n) and the overall space 
is 0(n/r) for the pk s. 



To compute tk we check whether the period pk of Tk extends to Tk+ 2 - That 
is whether pk is the period of T[kr...(k + 4)r — 1]. The following is why we do 
so. 


Lemma 2. Suppose Tk and Tk +2 are periodic with periods pk and Pk+ 2 , respec¬ 
tively. If Tk ■ T k+ 2 is periodic with a period of length at most r then pk is the 
period and Pk +2 is a rotation of pk ■ 

Proof. To reach a contradition, suppose that pk is not the period of Tk ■ Tk +2 , 
i.e., it has period p' of length at most r. Since Tk is of size 2r, p' must also be a 
period of Tk- However, pk and p' are both prefixes of Tk and, hence, must have 
different lengths. By Corollary 1 \p'\ must be a multiple of \pk\ and, hence, not 
a period. The fact that Pk +2 is a rotation of pk can be deduced similarly. □ 

By induction it follows that Tk ■ Tk +2 ■ Tk+ a ■... ■ Tk+ 2 M, M > 1 , is periodic 
with pk if each concatenated pair of Tk+ 2 i • 2 fc+ 2 (»+i) is periodic with a period at 
most r. This suggests the following scheme. Generate a binary array of length 
n/r where location j is 1 iff T k+ 2 (j- 1 ) • Tk+ 2 j has a period of length of at most 
t. In all places where there is a 0, i.e., Tk ■ Tk +2 does not have a period of 
length at most r and Tk does have period pk, we have that tk £ [2r, 4r — 1]. We 
directly compute tk by extending pk in Tk ■ Tk +2 as far as possible. Now, using 
the described array and the tk £ [2r, 4r — 1], in a single sweep for the even fcr’s 
(and a seperate one for the odd ones) from right to left we can compute all tk s. 
It is easy to verify that the overall time is 0(n) and the space is 0{n/r). 

3 Randomized Trade-Offs 

In this section we describe a randomized LCE data structure using 0(n/r) space 
with 0(t + log -) query time. In Section 3.6 we describe another 0{n/r)- space 
LCE data structure that either answers an LCE query in constant time, or 
provides a certificate that t < r 2 . Combining the two data structures, shows 
that the LCE problem can be solved in 0{n/r) space and O(r) time. 

The randomization comes from our use of Karp-Rabin fingerprints [10] for 
comparing substrings of T for equality. Before describing the data structure, 
we start by briefly recapping the most important definitions and properties of 
Karp-Rabin fingerprints. 

3.1 Karp-Rabin Fingerprints 

For a prime p and x £ [p] the Karp-Rabin fingerprint [10], denoted <f p>x (T[i...j]), 
of the substring T[i...j] is defined as 

<j> P , x {T[i—j]) = ^2 T [ k \ xk ~ l mod P ■ 


If T[i...j] = T[i 'then clearly (j> PtX (T[i...j}) = <f> PiX (T[i'...j'}). In the Monte 
Carlo and the Las Vegas algorithms we present we will choose p such that p = 



0(n 4+c ) for some constant c > 0 and x uniformly from [p]\ {0}. In this case a 
simple union bound shows that the converse is also true with high probability, 
i.e., <f> is collision-free on all substring pairs of T with probability at least 1 — n~ c . 
Storing a fingerprint requires 0(1) space. When p,x are clear from the context 
we write <f> = 4> PtX - 

For shorthand we write f(i) = <j)(T[l,i]),i £ [l,n] for the fingerprint of the 
i th prefix of T. Assuming that we store the exponent x l mod p along with the 
fingerprint f(i), the following two properties of fingerprints are well-known and 
easy to show. 

Lemma 3. 1) Given f(i), the fingerprint f(i ± a) for some integer a, can be 
computed in 0(a) time. 2) Given fingerprints f(i ) and f(j), the fingerprint 
<f>(T[i..j]) can be computed in 0(1) time. 

In particular this implies that for a fixed length l. the fingerprint of all substrings 
of length l of T can be enumerated in 0(n) time using a sliding window. 


3.2 Overview 

The main idea in our solution is to binary search for the LCE(*,j) value using 
Karp-Rabin fingerprints. Suppose for instance that (j>(T[i,i + M]) <t>{T\j,j + 

M)) for some integer M, then we know that LCE(*,j) < M, and thus we can 
find the true LCE (i,j) value by comparing log(M) additional pair of fingerprints. 
The challenge is to obtain the fingerprints quickly when we are only allowed to 
use 0(n/r) space. We will partition the input string T into n/r blocks each of 
length t. Within each block we sample a number of equally spaced positions. 
The data structure consists of the fingerprints of the prefixes of T that ends 
at the sampled positions, i.e., we store f(i) for all sampled positions i. In total 
we sample 0(n/r) positions. If we just sampled a single position in each block 
(similar to the approach in [2]), we could compute the fingerprint of any substring 
in O(t) time (see Lemma 3), and the above binary search algorithm would take 
time 0(t logn) time. We present a new sampling technique that only samples an 
additional 0(n/r ) positions, while improving the query time to 0(r + log(£/r)). 


Preliminary definitions We partition the input string T into n/r blocks of r 
positions, and by block k we refer to the positions [hr, kr + r), for k € [n/r]. 

We assume without loss of generality that n and r are both powers of two. 
Every position q £ [l,n] can be represented as a bit string of length lgn. Let 
q £ [l,n] and consider the binary representation of q. We define the leftmost 
lg(n/r) bits and rightmost lg(r) bits to be the head , denoted h(q) and the tail , 
denoted £(</), respectively. A position is block aligned if t(q) = 0. The significance 
of q , denoted s(q), is the number of trailing zeros in h(q). Note that the r 
positions in any fixed block k £ [n/r] all have the same head, and thus also the 
same significance, which we denote by /i; c . See Figure 2. 



Fig. 2. Example of the definitions for the position q = 205035 in a string of length 
n = 2 19 with block length r = 2 s . Here h(q) is the first lg(n/r) = 11 bits, and t(q) is 
the last lg(r) = 8 bits in the binary representation of q. The significance is s(q) = 5. 


3.3 The Monte Carlo Data Structure 

The data structure consists of the values /(«), i £ S, for a specific set of sampled 
positions S C [l,n], along with the information necessary in order to look up 
the values in constant time. We now explain how to construct the set S. In 
block k £ [n/r] we will sample bk = min |2^ fc / 2 J, r} evenly spaced positions, 
where fik is the significance of the positions in block fc, i.e., fik = s(fcr). More 
precisely, in block k we sample the positions Bk = {kr + jr/bk | j £ [Ml, an d 
let S = U fee[n / T] Bk- See Figure 3. 

We now bound the size of S. The significance of a block is at most lg(n/r), 
and there are exactly 2 lg ("/' r ) _A ‘ blocks with significance /i, so 
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Fig. 3. Illustration of a string T partitioned into 16 blocks each of length r. The 
significance /,t*. for the positions in each block k £ [n/r] is shown, as well as the bk 
values. The block dots are the sampled positions S. 



3.4 Answering a query 

We now describe how to answer an LCE ( i , j ) query. We will assume that i is 
block aligned, i.e., i = kr for some k £ [n/r]. Note that we can always obtain this 
situation in O(t) time by initially comparing at most r — 1 pairs of characters 
of the input string directly. 

Algorithm 1 shows the query algorithm. It performs an exponential search to 
locate the block in which the first mismatch occurs, after which it scans the block 
directly to locate the mismatch. The search is performed by calls to check(i, j, c), 



which computes and compares + c]) and + c]). In other words, 

assuming that <f) is collision-free, check (i,j, c ) returns true if LCE(i, j) > c and 
false otherwise. 


Algorithm 1 Computing the answer to a query LCE(i, j) 

1: procedure LCE(i,jf) 

2 : i <- 0 

3: p «— 0 

4: while check(i, j, 2 M r) do > Compute an interval such that i £ [£, 21], 

5: (i,jj) <- (i + 2» T ,j + 2'V,I+2'V) 

6: if s(j ) > p then 

7: p ^— p -t~ 1 

8: while p > 0 do > Identify the block in which the first mismatch occurs 

9: if check(i, j, 2 M ^ 1 r) then 

10 : ( i,jj ) <- (i + 2 M_1 r, j + 2 M_1 r, f + 2 M_1 r) 

11: p 4— p — 1 

12: while T|i] = T[j] do > Scan the final block left to right to find the mismatch 

13: + +1,1+1) 

14: return £ 


Analysis We now prove that Algorithm 1 correctly computes £ = LCE(?',j) in 
0(t + log(f/r)) time. The algorithm is correct assuming that check(i, j, 2 A ‘r) 
always returns the correct answer, which will be the case if <j> is collision-free. 
The following is the key lennna we need to bound the time complexity. 

Lemma 4. Throughout Algorithm 1 it holds that £ > (2^ — l)r, s(j) > p, and 
H is increased in at least every second iteration of the first, while-loop. 

Pi'oof. We first prove that s(j) > p. The claim holds initially. In the first loop j 
is changed to j + 2 M r, and s(j + 2 ,i r) > min (s(j), s(2 m t)} = min {s(j), p} = p, 
where the last eciuality follows from the induction hypothesis s(j ) > p. Moreover, 
p is only incremented when s(j) > p. In the second loop j is changed to j+2 At “ 1 T, 
which under the assumption that s(j) > p, has significance s(j + 2 a ‘^ 1 t) = p— 1 . 
Hence the invariant is restored when p is decremented at line 11. 

Now consider an iteration of the first loop where p is not incremented, i.e., 
s(j) = p. Then ^7 is an odd integer, i.e. r is even, and hence s(j+2 A ‘r) > p, 
so p will be incremented in the next iteration of the loop. 

In order to prove that £ > (2 M - 1) r we will prove that £ > (2^ — 1) r in the 
first loop. This is trivial by induction using the observation that (2 M — 1) r + 
2 u t = ( 2^ +1 - 1 ) r. □ 

Since £ > (2 M — 1 )r and p is increased at least in every second iteration of the 
first loop and decreased in every iteration of the second loop, it follows that 







there are O(log(^/r)) iterations of the two first loops. The last loop takes 0(t) 
time. It remains to prove that the time to evaluate the O(log(£/r)) calls to 
check(i, j, 2 M r) sums to 0(t + log(£/r)). 

Evaluating check(z, j, 2 A1 r) requires computing 0(T[i...i+2 A1 r]) and <j)(T\j...j+ 
2 m t]). The first fingerprint can be computed in constant time because i and 
i + 2 M r are always block aligned (see Lemma 3). The time to compute the second 
fingerprint depends on how far j and j + 2each are from a sampled position, 
which in turn depends inversely on the significance of the block containing those 
positions. By Lemma 4, /z is always a lower bound on the significance of j, which 
implies that /z also lower bounds the significance of j + 2 A V, and thus by the way 
we sample positions, neither will have distance more than r/2 Lm/ 2 J a sam _ 
pled position in S. Finally, note that by the way /z is increased and decreased, 
check(i, j, 2 A ‘r) is called at most three times for any fixed value of /z. Hence, the 
total time to compute all necessary fingerprints can be bounded as 

/WM \ 

O f l + r/2^/ 2 J j = 0(t + log(£/r)) . 


3.5 The Las Vegas Data Structure 

We now describe an 0(n 3 / 2 )-time and 0{n/r)- space algorithm for verifiying 
that (j> is collision-free on all pairs of substrings of T that the query algorithm 
compares. If a collision is found we pick a new <j> and try again. With high 
probability we can find a collision-free cj> in a constant number of trials, so we 
obtain the claimed Las Vegas data structure. 

If r < y/n we use the verification algorithm of Bille et al. [2], using 0(nr + 
nlogn) time and 0(n/r) space. Otherwise, we use the simple 0(n 2 /r)-time and 
0 (n/r)-space algorithm described below. 

Recall that all fingerprint comparisions in our algorithm are of the form 

<p(T[kT...k,T + 2 1 t — 1]) = (j)(T[j...j + 2 1 t — 1]) 

for some k € [n/r\,j £ [n],l £ [log(n/r)J. The algorithm checks each l £ 
[log(n/r)] separately. For a fixed l it stores the fingerprints 0(T[fcr...fcr + 2 1 t]) 
for all k £ [n/r] in a hash table H. This can be done in 0(n) time and 0(ti/t) 
space. For every j £ [n] the algorithm then checks whether (f(T[j...j + 2 1 t\) £ H, 
and if so, it verifies that the underlying two substrings are in fact the same by 
comparing them character by character in 0(2 1 t) time. By maintaining the fin¬ 
gerprint inside a sliding window of length 2 r, the verification time for a fixed l 
becomes 0(n2 l r), i.e., 0(n 2 /t) time for all l £ [log(n/r)]. 

3.6 Queries with Long LCEs 

In this section we describe an O(^) space data structure that in constant time 
either correctly computes LCE (i,j) or determines that LCE (i,j) < r 2 . The data 
structure can be constructed in Oin log f ) time by a Monte Carlo or Las Vegas 
algorithm. 



The Data Structure Let S T C [l,n] called the sampled positions of T (to 
be defined below), and consider the sets A and B of suffixes of T and T R , 
respectively. 


A = {T[i...} | i £ S T } , B = {T[...i] R | i £ S T } . 

We store a data structure for A and B, that allows us to perform constant time 
longest common extension queries on any pair of suffixes in A or any pair in 
B. This can be achieved by well-known techniques, e.g., storing a sparse suffix 
tree for A and B , equipped with a nearest common ancestor data structure. To 
define S T , let D r = {0,1,..., r} U {2r,..., (r — l)r}, then 

S T = {1 < i < n | i mod r 2 £ D T } . (1) 


Answering a Query To answer an LCE query, we need the following def¬ 
initions. For i,j € S T let LCE/j (i,j) denote the longest common prefix of 
T[...i] R £ B and T[...j] R £ B. Moreover, for i,j £ [n], we define the function 

6(i,j) = (((i — j ) mod r) — i) mod r 2 . ( 2 ) 

We will write 6 instead of 6(i,j) when i and j are clear from the context. 

The following lemma gives the key property that allows us to answer a query. 


Lemma 5. For any i,j £ [n — r 2 ], it holds that i + 6,j + 5 £ S T . 

Proof. Direct calculation shows that (i + S) mod r 2 < r, and that (j + S) 
mod r = 0, and thus by definition both i + 6 and j + 5 are in S T . 

To answer a query LCE (i,j), we first verify that i,j £ [n — r 2 ] and that 
LCEfl(i+<5, j+S) > S. If this is not the case, we have established that LCE (i,j) < 
5 < r 2 , and we stop. Otherwise, we return S + LCE(i + 5, j + 5) — 1. 


Analysis To prove the correctness, suppose i,j £ [n — r 2 ] (if not clearly 
LCE(i, j) < t 2 ) then we have that i + 6, j + 6 £ S T (Lemma 5). If LCEij(i + 5,j + 
S) > 5 it holds that T[i...i + S\ = T\j...j + S\ so the algorithm correctly computes 
LCE(i, j) as 5 + 1 + LCE(i + <5, j + S). Conversely, if LCE/{(* + 6, j + 6) < 6, 
T[i...i + <5] T[j...j + d] it follows that LCE (i,j) < 6 < r 2 . 

Query time is 0(1), since computing <5, LCEfl(i+(5, j+S) and LCE(i+£, j+S) 
all takes constant time. Storing the data structures for A and B takes space 
0(\A\ + |I?|) = 0(|<S T |) = O(-). For the preprocessing stage, we can use recent 
algorithms by I et al. [ 8 ] for constructing the sparse suffix tree for A and B 
in 0 (—) space. They provide a Monte Carlo algorithm using 0(n log -) time 
(correct w.h.p.), and a Las Vegas algorithm using O(^) time (w.li.p.). 



4 Derandomizing the Monte Carlo Data Structure 

Here we give a general technique for derandomizing Karp-Rabin fingerprints, 
and apply it to our Monte Carlo algorithm. The main result is that for any con¬ 
stant e > 0, the data structure can be constructed completely deterministically 
in 0(n 2+s ) time using 0(n/r ) space. Thus, compared to the probabilistic pre¬ 
processing of the Las Vegas structure using 0(n 3 / 2 ) time with high probability, 
it is relatively cheap to derandomize the data structure completely. 

Our derandomizing technique is stated in the following lemma. 

Lemma 6. Let A,Lc (1,2,..., n} be a set of positions and lengths respectively 
such that max(L) = n^ 1 ). p or every e £ (0,1), there exist a fingerprinting 
function (f> that can be evaluated in O (f) time and has the propeHy that for all 
a £ 4, l £ L, i £ {T, 2,..., 7t}.‘ 

cj)(T[a...a + (l— 1)]) = <f)(T[i...i + (l — 1)]) T[a...a+ (l — 1)] = T[i...i+ (l — 1)] 

We can find such a f> using O (®) space and O — J og n ^ max(L) |L|^ time, 
for any value of S £ [1, |4|]. 

Proof. We let p be a prime contained in the interval [max(L)n £ ,2max(I)n e ]. 
The idea is to choose 4> to be </> = (< t> p , Xl , ■ ■ ■ > > k = 0(l/e), where 4> PtXi is 

the Karp-Rabin fingerprint with parameters p and Xi. 

Let S be the alphabet containing the characters of T. If p < \ S\ we note that 
since p = n Q(yl ' > and | S\ = n°A) we can split each character into 0(1) characters 
from a smaller alphabet S' satisfying p > |V'|. So wlog. assume p > |V| from 
now on. 

We let § be the set defined by: 

S = {(T[a...a + (l — l)],T[i.. .i + (l — 1)]) | a £ A, l £ L,i £ {1,2,..., n}} 

Then we have to find <f) such that <p(u) = <f(v) u = v for all (it, v) £ §. 

We will choose <j> PtX1 ,(/> PtX2 ,... ,(/> PtXk successively. For any 1 < l < k we let 

$<1 = (f’p^Xl 1 • ■ • 1 f^p^xi ) • 

For any fingerprinting function / we denote by B(f) the number of pairs 
(u,v) £ S such that f(u ) = f(v). We let R(id) denote the number of pairs 
(it, v) £ § such that u = v. Hence B(f) > R(id) for every fingerprinting function 
/, and / satisfies the requirement iff B(f) = R(id). 

Say that we have chosen (f<i for some l < k. We now compare B((f><i + 1) = 
B((4> P ,xi +1 ,4><i)) to B(cf><i ) when xi +1 is chosen uniformly at random from [p\. 
For any pair (it, u)eS such that it ^ v and <f><i (it) = <fi<i (it) we see that, if xi+i is 
chosen randomly independently then the probability that (f> P , Xl+1 (it) = 4> P ,x l+1 (v ) 
is at most max lMiHl < n~ s . Hence 


®^!+i (B(4>< 1+1 ) - B( id))< 


B(4><i) - B( id) 
n e 



and thus there exists Xi + \ £ { 0 , 1 ,... ,p — 1 } such that 


id)) < B( ^> 7 B(id) . 
n e 

Now consider the algorithm where we construct <f> successively by choosing xi+i 
such that 1 ) is as small as possible. Then 4 


W) - B(id) - B^ t) -B( id, < < ... < Silhjmi 


Since B( 1) < n 3 and £?((/>) — l?(id) is an integer it suffices to choose k = |~|"| in 
order to conclude that B(<f>) = B(id). 

We just need to argue that for given x\,... ,xi we can find B(cf><i) in space 
O (yS 1 ) and time O ^7 \L\ nlog??-^^ . First we show how to do it in the case S = 
|A|. To this end we will count the number of collisions for all (it, v) € S such that 
M = \v\ = l £ L for some fixed l £ L. First we compute (f<i(T[a ... a+(l — 1]) for 
all a £ A and store them in a multiset. Using a sliding window the fingerprints 
can be computed and inserted in time O (In log n) Now for each position i £ [1, n] 
we compute <p<i ( T[i... i + (l — 1]) and count the number of times the fingerprint 
appears in the multiset. Using a sliding window this can be done in O (In log n) 
time as well. Doing this for all l £ L gives the number of collisions. 

Now assume that S < |A|. Then let r = ^ = O an d partition 

A into Ai,...,A r consisting of <5 elements each. For each j = 1, 2,..., r we 
define §j as 


Sj = {(T[a... a + (l — 1)], T[i ...i + (l— 1)]) | a £ Aj, l £ L,i £ 


{ 1,2 


Then Si,..., § r is a partition of S. For each j = 1,2,...,?’ we can find the 
collisions in among the elements in S j in the manner described above. Doing 
this for each j and adding the results is sufficient to get the desired space and 
time complexity. □ 


Corollary 2. For any r £ [l,n], the LCE problem can be solved by a deter¬ 
ministic data structure with 0(n/r) space usage and 0(t) query time. The data 
structure can be constructed in 0(n 2+e ) time using 0(n/r) space. 


Proof. We use the lemma with A = {hr \ k £ [n/r]}, L = {2 1 t \ l £ [log(n/r)]}, 
S = \A\ = n/r and a suitable small constant e > 0. 
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